PARADISE: A Framework for Evaluating Spoken Dialogue Agents 



Marilyn A. Walker, Diane J. Litman, Candace A. Kamm and Alicia Abella 

AT&T Labs — Research 
1 80 Park Avenue 
Florham Park, NJ 07932-0971 USA 
walker,diane,cak,abella@research. att.com 



Abstract 

This paper presents PARADISE (PARAdigm 
for Dialogue System Evaluation), a general 
framework for evaluating spoken dialogue 
agents. The framework decouples task require- 
ments from an agent's dialogue behaviors, sup- 
ports comparisons among dialogue strategies, 
enables the calculation of performance over 
subdialogues and whole dialogues, specifies 
the relative contribution of various factors to 
performance, and makes it possible to compare 
agents performing different tasks by normaliz- 
ing for task complexity. 

1 Introduction 

Recent advances in dialogue modeling, speech recogni- 
tion, and natural language processing have made it pos- 
sible to build spoken dialogue agents for a wide vari- 
ety of applications. Potential benefits of such agents 
include remote or hands-free access, ease of use, natu- 
ralness, and greater efficiency of interaction. However, 
a critical obstacle to progress in this area is the lack of 
a general framework for evaluating and comparing the 
performance of different dialogue agents. 

One widely used approach t o evaluation is based o n 
the notion of a reference answer ( |Hirschmanetal., 199C| ). 



An agent's responses to a query are compared with a 
predefined key of minimum and maximum reference an- 
swers; performance is the proportion of responses that 
match the key. This approach has many widely acknowl- 
edged lim itations (|Hirschman and Pao, 1993 ; Danieli et 
al., 1992; Bates and Ayuso, 1993), e.g., although there 
may be many potential dialogue strategies for carrying 
out a task, the key is tied to one particular dialogue strat- 
egy 

x We use the term agent to emphasize the fact that we are 
evaluating a speaking entity that may have a personality. Read- 
ers who wish to may substitute the word "system" wherever 
"agent" is used. 



In contrast, agents using different dialogue strategies 
can be compared with measures such as inappropri- 
ate utterance ratio, turn correction ratio, concept accu- 
racy, implicit recovery and transaction success (Danieli 
and Gerbino, 1995; Hirschman and Pao, 1993; Po- 
lifroni et al., 1992; Simpson and Fraser, 1993; Shriberg, 
Wade, and Price, 1992). Consider a comparison of two 
train timetable information agents (Danieli and Gerbino, 
1995), where Agent A in Dialogue 1 uses an explicit con- 
firmation strategy, while Agent B in Dialogue 2 uses an 
implicit confirmation strategy: 

(1) User: I want to go from Torino to Milano. 

Agent A: Do you want to go from Trento to Milano? 
Yes or No? 
User: No. 

(2) User: I want to travel from Torino to Milano. 
Agent B: At which time do you want to leave from 
Merano to Milano? 

User: No, I want to leave from Torino in the 
evening. 

Danieli and Gerbino found that Agent A had a higher 
transaction success rate and produced less inappropriate 
and repair utterances than Agent B, and thus concluded 
that Agent A was more robust than Agent B. 

However, one limitation of both this approach and the 
reference answer approach is the inabilit y to generaliz e 
results to other tasks and environments ( Fraser, 1995 ). 
Such generalization req uires the ident ification of factors 
that affect performance ( Ho hen, 1995 ; Sparck-Jones and 
Galliers, 1996). For example, while Danieli and Gerbino 
found that Agent A's dialogue strategy produced dia- 
logues that were approximately twice as long as Agent 
B's, they had no way of determining whether Agent A's 
higher transaction success or Agent B's efficiency was 
more critical to performance. In addition to agent factors 
such as dialogue strategy, task factors such as database 
size and environmental factors such as background noise 
may also be relevant predictors of performance. 



These approaches are also limited in that they cur- 
rently do not calculate performance over subdialogues as 
well as whole dialogues, correlate performance with an 
external validation criterion, or normalize performance 
for task complexity. 




Figure 1: PARADISE'S structure of objectives for spo- 
ken dialogue performance 

This paper describes PARADISE, a general frame- 
work for evaluating spoken dialogue agents that ad- 
dresses these limitations. PARADISE supports compar- 
isons among dialogue strategies by providing a task rep- 
resentation that decouples what an agent needs to achieve 
in terms of the task requirements from how the agent 
carries out the task via dialogue. PARADISE uses a 
decision-theoretic framework to specify the relative con- 
tribution of various factors to an agent's overall perfor- 
mance. Performance is modeled as a weighted function 
of a task-based success measure and dialogue-based cost 
measures, where weights are computed by correlating 
user satisfaction with performance. Also, performance 
can be calculated for subdialogues as well as whole di- 
alogues. Since the goal of this paper is to explain and 
illustrate the application of the PARADISE framework, 
for expository purposes, the paper uses simplified do- 
mains with hypothetical data throughout. Section 2 de- 
scribes PARADISE'S performance model, and Section 3 
discusses its generality, before concluding in Section 4. 

2 A Performance Model for Dialogue 

PARADISE uses methods from decision theory (Keeney 
and Raiffa, 1976; Doyle, 1992 ) to combine a disparate 
set of performance measures (i.e., user satisfaction, task 



success, and dialogue cost, all of which have been pre- 
viously noted in the literature) into a single performance 
evaluation function. The use of decision theory requires 
a specification of both the objectives of the decision 
problem and a set of measures (known as attributes in 
decision theory) for operationalizing the objectives. The 
PARADISE model is based on the structure of objectives 
(rectangles) shown in Figure [j] The PARADISE model 
posits that performance can be correlated with a mean- 
ingful external criterion such as usability, and thus that 
the overall goal of a spoken dialogue agent is to maxi- 
mize an objective related to usability. User satisfaction 
ratings (Kamm, 1995| ; Shriberg, Wade, and Price, 1992j ; 
Polifroni et al., 1992) have been frequently used in the 



literature as an external indicator of the usability of a di- 
alogue agent. The model further posits that two types 
of factors are potential relevant contributors to user sat- 
isfaction (namely task success and dialogue costs), and 
that two typ es of factors ar e potential relevant contribu- 
tors to costs (IWalker, 1996b. 



In addition to the use of decision theory to create this 
objective structure, other novel aspects of PARADISE 
include the use of the Kappa coefficient ( Carle tta, 1996j ; 
[Siegel and Castellan, 1988 ) to operationalize task suc- 
cess, and the use of linear regression to quantify the rel- 
ative contribution of the success and cost factors to user 
satisfaction. 

The remainder of this section explains the measures 
(ovals in Figure 1) used to operationalize the set of objec- 
tives, and the methodology for estimating a quantitative 
performance function that reflects the objective structure. 



Section 2.1 describes PARADISE'S task representation, 



which is needed to calculate the task-based success mea- 



sure described in Section 2.2. Section 2.3 describes the 



cost measures considered in PARADISE, which reflect 
both the efficiency and the naturalness of an agent's dia- 
logue behaviors. Section 2A describes the use of linear 
regression and user satisfaction to estimate the relative 
contribution of the success and cost measures in a sin- 



gle performance function. Finally, Section 2.5 explains 
how performance can be calculated for subdialogues as 
well as whole dialogues, while Section summarizes 
the method. 



2.1 Tasks as Attribute Value Matrices 

A general evaluation framework requires a task represen- 
tation that decouples what an agent and user accomplish 
from how the task is accomplished using dialogue strate- 
gies. We propose that an attribute value matrix (AVM) 
can represent many dialogue tasks. This consists of the 
information that must be exchanged between the agent 
and the user during the dialogue, represented as a set of 



ordered pairs of attributes and their possible values.^] 

As a first illustrative example, consider a simplifica- 
tion of the train timetable domain of Dialogues 1 and 
2, where the timetable only contains information about 
rush-hour trains between four cities, as shown in Table 1 . 
This AVM consists of four attributes (abbreviations for 
each attribute name are also shown).0 In Table 1, these 
attribute-value pairs are annotated with the direction of 
information flow to represent who acquires the informa- 
tion, although this information is not used for evaluation. 
During the dialogue the agent must acquire from the user 
the values of DC, AC, and DR, while the user must ac- 
quire DT. 



attribute 


possible values 


information flow 


depart-city (DC) 
arrival-city (AC) 
depart-range (DR) 
depart-time (DT) 


Milano, Roma, Torino, Trento 
Milano, Roma, Torino, Trento 
morning, evening 
6am,8am,6pm,8pm 


to agent 
to agent 
to agent 
to user 



Table 1 : Attribute value matrix, simplified train timetable 
domain 



B 1 : 


Hello, This is Train Enquiry Service. 


DC,AC,DR,DT 




Please speak after the tone. 


DC.AL.UK.Ul 




Which information do you need? 


UC.AI_.UK.U1 


U 1 : 


1 want to travel from Torino to JV-hlano 


DC AC 


B2: 


At which time do you want to leave 






from Merano to Milano? 


DC.AC.DR 


U2: 


No, I want to leave from Torino 






in the evening. 


DC, DR 


B3: 


Do you want to leave from Torino 






between 6 and 1 1 p.m? 


DC, DR 




Please answer Yes or No. 


DC, DR 


U3: 


Yes. 


DC, DR 


B4: 


A train leaves at 8 p.m. 


DT 



Figure 3: Agent B dialogue interaction (Danieli and 
Gerbino, 1995) 



attribute 


actual value 


depart-city 
arrival-city 
depart-range 
depart-time 


Torino 
Milano 
evening 
8pm 



Table 2: Attribute value matrix instantiation, scenario 
key for Dialogues 1 and 2 



Al 


Hello, This is Train Enquiry Service. 


DC,AC,DR,DT 




Please speak after the tone. 


DC,AC,DR,DT 




Which information do you need? 


DC,AC,DR,DT 


Ul 


I want to go from Torino to Milano. 


DC,AC 


A2 


Do you want to go from Trento to Milano? 


DC.AC 




Yes or No? 


DC.AC 


(J 2 


No. 


DC.AC 


A3 


Do you want to leave from Trento? 


DC 




Yes or No? 


DC 


U3 


No. 


DC 


A4 


Where do you want to leave from? 


DC 


I! 4 


I want to leave from Torino. 


DC 


A5 


Do you want to leave from Torino? 


DC 




Yes or No? 


DC 


U5 


Yes. 


DC 


A6 


Do you want to go to Milano? 


AC 


L!6 


Yes. 


AC 


A7 


At which time do you want to leave? 


DR 


U7 


I want to travel in the evening. 


DR 


A8 


Do you want to leave between 6 and 9 p.m.? 


DR 




Yes or No? 


DR 


U8 


Yes. 


DR 


A9 


There is a train leaving at 8:00 p.m. 


DT 



dialogue. 

For example, assume that a scenario requires the user 
to find a train from Torino to Milano that leaves in the 
evening, as in the longer versions of Dialogues 1 and 2 in 
Figures 2 and 3.f] Table 2 contains an AVM correspond- 
ing to a "key" for this scenario. All dialogues resulting 
from execution of this scenario in which the agent and 
the user correctly convey all attribute values (as in Fig- 
ures 2 and 3) would have the same AVM as the scenario 
key in Table 2. The AVMs of the remaining dialogues 
would differ from the key by at least one value. Thus, 
even though the dialogue strategies in Figures 2 and 3 are 
radically different, the AVM task representation for these 
dialogues is identical and the performance of the system 
for the same task can thus be assessed on the basis of the 
AVM representation. 



Figure 2: Agent A dialogue interaction (Danieli and 
Gerbino, 1995) 

Performance evaluation for an agent requires a corpus 
of dialogues between users and the agent, in which users 
execute a set of scenarios. Each scenario execution has a 
corresponding AVM instantiation indicating the task in- 
formation requirements for the scenario, where each at- 
tribute is paired with the attribute value obtained via the 



2 For infinite sets of values, actual values found in the exper- 
imental data constitute the required finite set. 

3 The AVM serves as an evaluation mechanism only. We are 
not claiming that AVMs determine an agent's behavior or serve 
as an utterance's semantic representation. 



2.2 Measuring Task Success 

Success at the task for a whole dialogue (or subdi- 
alogue) is measured by how well the agent and user 
achieve the information requirements of the task by the 
end of the dialogue (or subdialogue). This section ex- 
plains how PARADISE uses the Kappa coefficient (Car- 
letta, 1996; Siegel and Castellan, 1988[ ) to operationalize 
the task-based success measure in Figure [j] 

The Kappa coefficient, k, is calculated from a confu- 
sion matrix that summarizes how well an agent achieves 
the information requirements of a particular task for a set 



4 These dialogu es have been slightly modified from (D anieli 
and Gerbino, 1995). The attribute names at the end of each 
utterance will be explained below. 





KEY 




DEPART-CITY 


ARRIVAL-CITY 


DEPART-RANGE 


DEPART-TIME 


DATA 


v 1 v2 v3 v4 


v5 v6 v7 v8 


v9 v 1 


Vll Vlii VlJ Vl^t 


vl 
v2 
v3 
v4 


22 1 
29 

4 16 4 
1 1 5 11 


3 

1 
1 






v5 
v6 
v7 
v8 


3 

2 
1 


20 

22 

1 1 20 5 
1 2 8 15 






v9 
vlO 






45 10 
5 40 




vll 
vl2 
vl3 
vl4 








20 2 

1 19 2 4 

2 18 

2 6 3 21 


sum 


30 30 25 15 


25 25 30 20 


50 50 


25 25 25 25 



Table 3: Confusion matrix, Agent A 





KEY 




DEPART-CITY 


ARRIVAL-CITY 


DEPART-RANGE 


DEPART-TIME 


DATA 


vl v2 v3 v4 


v5 v6 v7 v8 


v9 vlO 


vll vl2 vl3 vl4 


vl 
v2 
v3 
v4 


16 1 
1 20 1 
5 19 4 
12 6 6 


4 

3 

2 4 2 
2 3 


3 2 




v5 
v6 
v7 
v8 


4 

1 6 

5 2 
1 3 3 


15 

19 

1 1 15 4 
1 2 9 11 


2 3 




v9 
vlO 


2 


2 


39 10 
6 35 




vll 

vl2 
vl3 
vl4 








20 5 5 4 
10 5 5 
5 5 10 5 
5 5 11 


sum 


30 30 25 15 


25 25 30 20 


50 50 


25 25 25 25 



Table 4: Confusion matrix, Agent B 



of dialogues instantiating a set of scenarios.^] For exam- 
ple, Tables 3 and 4 show two hypothetical confusion ma- 
trices that could have been generated in an evaluation of 
100 complete dialogues with each of two train timetable 
agents A and B (perhaps using the confirmation strate- 
gies illustrated in Figures 2 and 3, respectively).^ The 
values in the matrix cells are based on comparisons be- 
tween the dialogue and scenario key AVMs. Whenever 
an attribute value in a dialogue (i.e., data) AVM matches 
the value in its scenario key, the number in the appro- 
priate diagonal cell of the matrix (boldface for clarity) 
is incremented by 1. The off diagonal cells represent 
misunderstandings that are not corrected in the dialogue. 
Note that depending on the strategy that a spoken dia- 
logue agent uses, confusions across attributes are possi- 
ble, e.g., "Milano " could be confused with "morning." 
The effect of misunderstandings that are corrected dur- 



Confusion matrices can be constructed to summarize the 
result of dialogues for any subset of the scenarios, attributes, 
users or dialogues. 

6 The distribution s in the tables were rouehl v based on per- 
formance results in ( Danieli and Gerbino, 1995| ). 



ing the course of the dialogue are reflected in the costs 
associated with the dialogue, as will be discussed below. 

The first matrix summarizes how the 100 AVMs rep- 
resenting each dialogue with Agent A compare with 
the AVMs representing the relevant scenario keys, while 
the second matrix summarizes the information exchange 
with Agent B. Labels vl to v4 in each matrix represent 
the possible values of depart-city shown in Table 1; v5 
to v8 are for arrival-city, etc. Columns represent the key, 
specifying which information values the agent and user 
were supposed to communicate to one another given a 
particular scenario. (The equivalent column sums in both 
tables reflects that users of both agents were assumed to 
have performed the same scenarios). Rows represent the 
data collected from the dialogue corpus, reflecting what 
attribute values were actually communicated between the 
agent and the user. 

Given a confusion matrix M, success at achieving the 
information requirem ents of the task is measured with 
the Kappa coefficient ( Marietta, 1996; Siegel and Castel- 



Ian, 1988): 



P(A) - P{E) 
1 - P(E) 



P(A) is the proportion of times that the AVMs for the 
actual set of dialogues agree with the AVMs for the sce- 
nario keys, and P(E) is the proportion of times that the 
AVMs for the dialogues and the keys are expected to 
agree by chance.]] When there is no agreement other than 
that which would be expected by chance, n = 0. When 
there is total agreement, k = 1. K is superior to other 
measures of success such as transaction success (Danieli 
and Gerbino, 1995), concept accuracy (Simpson and 
Fraser, 1993), and percent agreement (Gale, Church, and 
Yarowsky, 1992) because k takes into account the inher- 
ent complexity of the task by correcting for chance ex- 
pected agreement. Thus k provides a basis for compar- 
isons across agents that are performing different tasks. 

When the prior distribution of the categories is un- 
known, P(E), the expected chance agreement between 
the data and the key, can be estimated from the distri- 
bution of the values in the keys. This can be calculated 
from confusion matrix M, since the columns represent 
the values in the keys. In particular: 

- t- 



dialogue behaviors that should be minimized. A wide 
range of cost measures have been used in previous work; 
these include pure efficiency measures such as the num- 
ber of turns or elapsed time to complete the task (Abella, 
Brown, and Buntschuh, 1996; fr-[irschman et al., 199C| ; 
|Smith and Gordon, 1997| ; |Walker, 1996j ), as well as mea- 
sures of qualitative phenomena such as inappropriate or 
repair utterances (Danieli and Gerbino, 1995 ; Hirschman 



and Pao, 1993; Simpson and Fraser, 1993) 



PARADISE represents each cost measure as a func- 
tion Ci that can be applied to any (sub)dialogue. First, 
consider the simplest case of calculating efficiency mea- 
sures over a whole dialogue. For example, let c\ be the 
total number of utterances. For the whole dialogue Dl in 
Figure 2, ci(Dl) is 23 utterances. For the whole dialogue 
D2 in Figure 3, ci(D2) is 10 utterances. 

To calculate costs over subdialogues and for some 
of the qualitative measures, it is necessary to be able 
to specify which information goals each utterance con- 
tributes to. PARADISE uses its AVM representation to 
link the information goals of the task to any arbitrary 
dialogue behavior, by tagging the dialogue with the at- 
tributes for the task.[] This makes it possible to evaluate 
any potential dialogue strategies for achieving the task, 
as well as to evaluate dialogue strategies that operate at 
the level of dialogue subtasks (subdialogues). 



where ij is the sum of the frequencies in column i of M, 
and T is the sum of the frequencies in M (ti + . . . + t n ). 

P(A), the actual agreement between the data and the 
key, is always computed from the confusion matrix M: 



P(A) = 



T 



Given the confusion matrices in Tables 3 and 4, P(E) 
= 0.079 for both agents.^ For Agent A, P(A) = 0.795 
and k = 0.777, while for Agent B, P(A) = 0.59 and k = 
0.555, suggesting that Agent A is more successful than 
B in achieving the task goals. 

2.3 Measuring Dialogue Costs 

As shown in Figure EL performance is also a function of a 
combination of cost measures. Intuitively, cost measures 
should be calculated on the basis of any user or agent 

7 k has been used to measure p airwise agreeme nt among 
coders ma king category judgments (Carletta, 1996 ^Krippen- 
dorf, 1980"; fciegel and Castellan, 1988[ ). Thus, the observed 
user/agent interactions are modeled as a coder, and the ideal 
interactions as an expert coder. 

8 Using a single confusion matrix for all attributes as in Ta- 
bles 3 and 4 inflates k when there are few cross-attribute confu- 
sions by making P(E) smaller. In some cases it might be desir- 
able to calculate k first for identification of attributes and then 
for values within attributes, or to average k for each attribute to 
produce an overall k for the task. 





SEGMENT: S2 




f ^ 

SEGMENT: S3 




, \ 

SEGMENT: S6 


GOALS:DC.AC 
UTTERANCES: UL..U6 




GOALS: DR 
UTTERANCES: All 




GOALS: DT 
UTTERANCES: A9 


I 














Figure 4: Task-defined discourse structure of Agent A 
dialogue interaction 



9 This tagging can be hand generated, or system generated 
and hand corrected. Preliminary studies indicate that reliability 
for human tagging is higher for AVM attribute tagging than 
for other typ es of discourse segment tagging (P assonneau and 
Litman, 1997; |Hirschberg and Nakatani, 



Consider the longer versions of Dialogues 1 and 2 in 
Figures 2 and 3. Each utterance in Figures 2 and 3 has 
been tagged using one or more of the attribute abbre- 
viations in Table 1, according to the subtask(s) the ut- 
terance contributes to. As a convention of this type of 
tagging, utterances that contribute to the success of the 
whole dialogue, such as greetings, are tagged with all the 
attributes. Since the structure of a dialogue reflects the 
struct ure of the task (jCarberry, 1989| Grosz and Sidner, 
1986; Litman and Allen, 1990), the tagging of a dialogue 



by the AVM attributes can be used to generate a hierar- 
chical discourse structure such as that shown in Figure 4 
for Dialogue 1 (Figure 2). For example, segment (subdi- 
alogue) S2 in Figure 4 is about both depart-city (DC) and 
arrival-city (AC). It contains segments S3 and S4 within 
it, and consists of utterances Ul . . . U6. 

Tagging by AVM attributes is required to calculate 
costs over subdialogues, since for any subdialogue, task 
attributes define the subdialogue. For subdialogue S4 in 
Figure 4, which is about the attribute arrival-city and con- 
sists of utterances A6 and U6, ci(S4) is 2. 

Tagging by AVM attributes is also required to calcu- 
late the cost of some of the qualitative measures, such as 
number of repair utterances. (Note that to calculate such 
costs, each utterance in the corpus of dialogues must also 
be tagged with respect to the qualitative phenomenon in 
question, e.g. whether the utterance is a repair.0) For 
example, let C2 be the number of repair utterances. The 
repair utterances in Figure 2 are A3 through U6, thus 
C2(D1) is 10 utterances and C2(S4) is 2 utterances. The 
repair utterance in Figure 3 is U2, but note that according 
to the AVM task tagging, U2 simultaneously addresses 
the information goals for depart-range. In general, if an 
utterance U contributes to the information goals of N dif- 
ferent attributes, each attribute accounts for 1/N of any 
costs derivable from U. Thus, C2(D2) is .5. 

Given a set of Cj, it is necessary to combine the dif- 
ferent cost measures in order to determine their relative 
contribution to performance. The next section explains 
how to combine n with a set of c,; to yield an overall per- 
formance measure. 

2.4 Estimating a Performance Function 

Given the definition of success and costs above and the 
model in Figure [j], performance for any (sub)dialogue D 
is defined as follows:^ 

10 Previous work has shown that this can be done with high 



reliability (Hirschman and Pao, 1992 ). 

n We assume an additive performance (utility) function be- 
cause it appears that k and the various cost factors a are util- 
ity i ndependent and additive independent ( Keeney and Raiffa, 
1976). It is possible however that user satisfaction data col- 
lected in future experiments (or other data such as willingness 
to pay or use) would indicate otherwise. If so, continuing use of 
an additive function might require a transformation of the data, 



Performance = (a * M(k)) 



i=l 



Here a is a weight on n, the cost functions Cj are 
weighted by Wi, and Af is a Z score normalization func- 
tion ( |Cohen, 1995] ). 

The normalization function is used to overcome the 
problem that the values of Cj are not on the same scale as 
k, and that the cost measures a may also be calculated 
over widely varying scales (e.g. response delay could 
be measured using seconds while, in the example, costs 
were calculated in terms of number of utterances). This 
problem is easily solved by normalizing each factor x to 
its Z score: 

Af(x) = 



X — X 



where <j x is the standard deviation for x. 



user 


agent 


US 


K 


ci (#utt) 


c 2 (#rep) 


1 


A 


1 


1 


46 


30 


2 


A 


2 


1 


50 


30 


3 


A 


2 


1 


52 


30 


4 


A 


3 


1 


40 


20 


5 


A 


4 


1 


23 


10 


6 


A 


2 


1 


50 


36 


7 


A 


1 


0.46 


75 


30 


8 


A 


1 


0.19 


60 


30 


9 


B 


6 


1 


8 





10 


B 


5 


1 


15 


1 


11 


B 


6 


1 


10 


0.5 


12 


B 


5 


1 


20 


3 


13 


B 


1 


0.19 


45 


18 


14 


B 


1 


0.46 


50 


22 


15 


B 


2 


0.19 


34 


18 


16 


B 


2 


0.46 


40 


18 


Mean(A) 


A 


2 


0.83 


49.5 


27 


Mean(B) 


B 


3.5 


0.66 


27.8 


10.1 


Mean 


NA 


2.75 


0.75 


38.6 


18.5 



Table 5: Hypothetical performance data from users of 
Agents A and B 

To illustrate the method for estimating a performance 
function, we will use a subset of the data from Tables 
3 and 4, shown in Table 5. Table 5 represents the re- 
sults from a hypothetical experiment in which eight users 
were randomly assigned to communicate with Agent A 
and eight users were randomly assigned to communicate 
with Agent B. Table 5 shows user satisfaction (US) rat- 
ings (discussed below), k, number of utterances (#utt) 
and number of repair utterances (#rep) for each of these 
users. Users 5 and 1 1 correspond to the dialogues in Fig- 
ures 2 and 3 respectively. To normalize c\ for user 5, we 
determine that cT is 38.6 and a Cl is 18.9. Thus, A/"(ci) is 
-0.83. Similarly JV(ci) for user 11 is -1.51. 

To estimate the performance function, the weights a 
and Wi must be solved for. Recall that the claim implicit 

a reworking of the model show n in Figure jl] , or the inclusion of 
interaction terms in the model ( [Cohen, 1995 ). 



in Figure [j] was that the relative contribution of task suc- 
cess and dialogue costs to performance should be calcu- 
lated by considering their contribution to user satisfac- 
tion. User satisfaction is typically calculated with sur- 
veys that ask users to specify the degree to which they 
agree with one or more statements about the behavior or 
the performance of the system. A single user satisfaction 
measure can be calculated from a single question, or as 
the mean of a set of ratings. The hypothetical user satis- 
faction ratings shown in Table 5 range from a high of 6 
to a low of 1 . 

Given a set of dialogues for which user satisfaction 
(US), K and the set of c; have been collected experimen- 
tally, the weights a and Wi can be solved for using multi- 
ple linear regression. Multiple linear regression produces 
a set of coefficients (weights) describing the relative con- 
tribution of each predictor factor in accounting for the 
variance in a predicted factor. In this case, on the basis 
of the model in Figure |lj US is treated as the predicted 
factor. Normalization of the predictor factors (k and Cj) 
to their Z scores guarantees that the relative magnitude 
of the coefficients directly indicates the relative contri- 
bution of each factor. Regression on the Table 5 data for 
both sets of users tests which factors k, #utt, #rep most 
strongly predicts US. 

In this illustrative example, the results of the regres- 
sion with all factors included shows that only k and #rep 
are significant (p < .02). In order to develop a perfor- 
mance function estimate that includes only significant 
factors and eliminates redundancies, a second regression 
including only significant factors must then be done. In 
this case, a second regression yields the predictive equa- 
tion: 

Performance = AOAT(k) - .78A/"(c 2 ) 

i.e., a is .40 and W2 is .78. The results also show k is 
significant at p < .0003, #rep significant at p < .0001, 
and the combination of k and #rep account for 92% of 
the variance in US, the external validation criterion. The 
factor #utt was not a significant predictor of performance, 
in part because #utt and #rep are highly redundant. (The 
correlation between #utt and #rep is 0.91). 

Given these predictions about the relative contribution 
of different factors to performance, it is then possible 
to return to the problem first introduced in Section [j]: 
given potentially conflicting performance criteria such as 
robustness and efficiency, how can the performance of 
Agent A and Agent B be compared? Given values for 
a and Wi, performance can be calculated for both agents 
using the equation above. The mean performance of A 
is -.44 and the mean performance of B is .44, suggesting 
that Agent B may perform better than Agent A overall. 

The evaluator must then however test these perfor- 
mance differences for statistical significance. In this 



case, a t test shows that differences are only significant 
at the p < .07 level, indicating a trend only. In this case, 
an evaluation over a larger subset of the user population 
would probably show significant differences. 

2.5 Application to Subdialogues 

Since both k and Cj can be calculated over subdialogues, 
performance can also be calculated at the subdialogue 
level by using the values for a and Wi as solved for 
above. This assumes that the factors that are predictive of 
global performance, based on US, generalize as predic- 
tors of local performance, i.e. within subdialogues de- 
fined by subtasks, as defined by the attribute tagging p| 

Consider calculating the performance of the dialogue 
strategies used by train timetable Agents A and B, over 
the subdialogues that repair the value of depart-city. Seg- 
ment S3 (Figure 4) is an example of such a subdialogue 
with Agent A. As in the initial estimation of a perfor- 
mance function, our analysis requires experimental data, 
namely a set of values for k and c,, and the application 
of the Z score normalization function to this data. How- 
ever, the values for k and c, are now calculated at the 
subdialogue rather than the whole dialogue level. In ad- 
dition, only data from comparable strategies can be used 
to calculate the mean and standard deviation for normal- 
ization. Informally, a comparable strategy is one which 
applies in the same state and has the same effects. 

For example, to calculate k for Agent A over the sub- 
dialogues that repair depart-city, P(A) and P(E) are com- 
puted using only the subpart of Table 3 concerned with 
depart-city. For Agent A, P(A) = .78, P(E) = .265, and 
K = .70. Then, this value of k is normalized using data 
from comparable subdialogues with both Agent A and 
Agent B. Based on the data in Tables 3 and 4, the mean 
k is .515 and a is .261, so that M(k) for Agent A is .71. 

To calculate c 2 for Agent A, assume that the average 
number of repair utterances for Agent A's subdialogues 
that repair depart-city is 6, that the mean over all compa- 
rable repair subdialogues is 4, and the standard deviation 
is 2.79. ThenA/"(c 2 ) is .72. 

Let Agent A's repair dialogue strategy for subdia- 
logues repairing depart-city be and Agent B's repair 
strategy for depart-city be R^. Then using the perfor- 
mance equation above, predicted performance for R^ is: 

Performance(R A ) = .40 * .71 - .78 * .72 = -0.28 

For Agent B, using the appropriate subpart of Table 
4 to calculate k, assuming that the average number of 
depart-city repair utterances is 1.38, and using similar 



This assumption has a sound basis in theories of dialogue 
structure ( barberry. 1989|: Grosz and Sidner. 1986|: L itman and 
Allen, 1990), but should be tested empirically. 



calculations, yields 

Performance (Rb ) = .40 * -.71 - .78 * -.94 = 0.45 

Thus the results of these experiments predict that when 
an agent needs to choose between the repair strategy that 
Agent B uses and the repair strategy that Agent A uses 
for repairing depart-city, it should use Agent B's strategy 
Rb, since the performance(Rs) is predicted to be greater 
than the performance^^). 

Note that the ability to calculate performance over 
subdialogues allows us to conduct experiments that si- 
multaneously test multiple dialogue strategies. For ex- 
ample, suppose Agents A and B had different strate- 
gies for presenting the value of depart-time (in addition 
to different confirmation strategies). Without the abil- 
ity to calculate performance over subdialogues, it would 
be impossible to test the effect of the different presen- 
tation strategies independently of the different confirma- 
tion strategies. 

2.6 Summary 

We have presented the PARADISE framework, and have 
used it to evaluate two hypothetical dialogue agents in a 
simplified train timetable task domain. We used PAR- 
ADISE to derive a performance function for this task, by 
estimating the relative contribution of a set of potential 
predictors to user satisfaction. The PARADISE method- 
ology consists of the following steps: 

• definition of a task and a set of scenarios; 

• specification of the AVM task representation; 

• experiments with alternate dialogue agents for the 
task; 

• calculation of user satisfaction using surveys; 

• calculation of task success using k; 

• calculation of dialogue cost using efficiency and 
qualitative measures; 

• estimation of a performance function using linear 
regression and values for user satisfaction, k and di- 
alogue costs; 

• comparison with other agents/tasks to determine 
which factors generalize; 

• refinement of the performance model. 

Note that all of these steps are required to develop the 
performance function. However once the weights in the 
performance function have been solved for, user satisfac- 
tion ratings no longer need to be collected. Instead, pre- 
dictions about user satisfaction can be made on the basis 



of the predictor variables, as illustrated in the application 
of PARADISE to subdialogues. 

Given the current state of knowledge, it is important to 
emphasize that researchers should be cautious about gen- 
eralizing a derived performance function to other agents 
or tasks. Performance function estimation should be 
done iteratively over many different tasks and dialogue 
strategies to see which factors generalize. In this way, 
the field can make progress on identifying the relation- 
ship between various factors and can move towards more 
predictive models of spoken dialogue agent performance. 

3 Generality 

In the previous section we used PARADISE to eval- 
uate two confirmation strategies, using as examples 
fairly simple information access dialogues in the train 
timetable domain. In this section we demonstrate that 
PARADISE is applicable to a range of tasks, domains, 
and dialogues, by presenting AVMs for two tasks involv- 
ing more than information access, and showing how ad- 
ditional dialogue phenomena can be tagged using AVM 
attributes. 



attribute 


possible values 


information flow 


depart-city (DC) 
arrival-city (AC) 
depart-range (DR) 
depart-time (DT) 
request-type (RT) 


Milano, Roma, Torino, Trento 
Milano, Roma, Torino, Trento 
morning, evening 
6am,8am,6pm,8pm 
reserve, purchase 


to agent 
to agent 
to agent 
to user 
to agent 



Table 6: Attribute value matrix, train timetable domain 
with requests 



First, consider an extension of the train timetable task, 
where an agent can handle requests to reserve a seat or 
purchase a ticket. This task could be represented using 
the AVM in Table 6 (an extension of Table 1), where the 
agent must now acquire the value of the attribute request- 
type, in order to know what to do with the other informa- 
tion it has acquired. 



Ul 


I want to go from Torino to Roma 


DC.AC 


CI 


Approximately what time of day would you like to travel? 


DR 


U2 


What are the options? 


DR 


C2 


Morning or evening. 


DR 


U3 


Are those departure times? 


DR 


C3 


Yes. 


DR 


U4 


I'd like to leave in the morning. 


DR 


C4 


Train 702 leaves Torino Porto at 8 a.m. 


DT 


U5 


Please reserve me a seat on that train. 


RT 



Figure 5: Hypothetical Agent C dialogue interaction 



Figure 5 presents a hypothetical dialogue in this ex- 
tended task domain, and illustrates user utterance types 
and an agent dialogue strategy that are very different 
from those in Figures 2 and 3. First, Agent C in Figure 5 



uses a "no confirmation" dialogue strategy, in contrast to 
the explicit and implicit confirmation strategies used in 
Figures 2 and 3. Second, Figure 5 illustrates new types 
of user utterances that do not directly further the infor- 
mational goals of the task. In U2, the user asks the agent 
a wh-question about the DR attribute itself, rather than 
providing information about that attribute's value. Since 
U2 satisfies a knowledge precondition related to answer- 
ing CI, U2 contributes to the DR goal and is tagged as 
such. In U3, the user similarly asks a yes-no question that 
addresses a subgoal related to answering CI. Finally, U5 
illustrates a user request for an agent action, and is tagged 
with the RT attribute. The value of RT in the AVM in- 
stantiation for the dialogue would be "reserve." 

Second, consider the very different domain and task 
of diagnosing a fault and repairing a circuit (S mith and 
Gordon, 1997). Figure 6 presents one dialogue from this 
domain. 



1. C: 

2. U 

3. C 

4. U 

5. C 



6. U 

7. C 

8. U 

9. C 

10. U 

11. C 

12. U 

13. C 

14. U 

15. C: 

16. U: 

17. C: 

18. U 

19. C 

20. U 

21. C: 

22. U: 

23. C: 

24. U 

25. C 

26. U 

27. C 

28. U 

29. C 

30. U 

31. C 

32. U 

33. C 

34. U 

35. C 



This is the circuit fix it shop. 

How may I help you? 

I want to fix a circuit. 

What is the ID of the circuit? 

Rslll. 

I am familiar with that circuit. 
The LED is supposed to be displaying 
alternately flashing one and seven. 
The LED is off. 

What is the switch at when the LED is off? 
The switch is down. 
Put the switch up. 
Okay. 

Put the knob to one zero. 
Okay. 

What is the LED displaying? 

LED is displaying only a flashing seven. 

Is there a wire between connector eight four 

and connector nine nine? 

No. 

Add a wire between connector eight four 

and connector nine nine. 

Done. 

What is the LED displaying? 
Alternately flashing one and seven. 
Is the one on the LED displaying for 
a longer period of time ? 
No. 

Is the seven on the LED displaying 
for a longer period of time? 
No. 

Put the knob to zero. 
Okay. 

What is the LED displaying? 
Alternately displaying one and seven. 
Is the one on the LED displaying 
for a longer period of time? 
Yes. 

Put the switch down. 
The switch is down. 
What is the LED displaying? 
Nothing. 

The circuit is working correctly. Good-bye. 



ID,CB,RB,FT,FC,T 

ID,CB,RB,FT,FC,T 

ID,CB,RB,FT,FC,T 

ID 

ID 

CB 

CB 
RB 
RB 
RB 
RB 
RB 
RB 
RB 
RB 
RB 

FT 
FT 

FC 
FC 
T 
T 

T 
T 

T 
T 
T 
T 
T 
T 

T 
T 
T 
T 
T 
T 
T 



Figure 6: A circuit domain dialogue (Smith and Gordon, 
1997), with AVM tagging 

Smith and Gordon collected 144 dialogues for this 
task, in which agent initiative was varied by using dif- 



ferent dialogue strategies, and tagged each dialogue ac- 
cording to the following subtask structure^ 

• Introduction (I) — establish the purpose of the task 

• Assessment (A) — establish the current behavior 

• Diagnosis (D) — establish the cause for the errant 
behavior 

• Repair (R) — establish that the correction for the er- 
rant behavior has been made 

• Test (T) — establish that the behavior is now correct 

Our informational analysis of this task results in the 
AVM shown in Table 7. Note that the attributes are al- 
most identical to Smith and Gordon's list of subtasks. 
Circuit-ID corresponds to Introduction, Correct-Circuit- 
Behavior and Current-Circuit-Behavior correspond to 
Assessment, Fault-Type corresponds to Diagnosis, Fault- 
Correction corresponds to Repair, and Test corresponds 
to Test. The attribute names emphasize information ex- 
change, while the subtask names emphasize function. 



attribute 


possible values 


Circuit-ID (ID) 

Correct-Circuit-Behavior (CB) 
Current-Circuit-Behavior (RB) 
Fault-Type (FT) 
Fault-Correction (FC) 
Test (T) 


RS111.RS112, ... 
Flash-1-7, Flash-1. ... 
Flash-7 

MissingWire84-99, MissingWire88-99, ... 
yes, no 
yes, no 



Table 7: Attribute value matrix, circuit domain 

Figure 6 is tagged with the attributes from Table 7. 
Smith and Gordon's tagging of this dialogue according 
to their subtask representation was as follows: turns 1- 
4 were I, turns 5-14 were A, turns 15-16 were D, turns 
17-18 were R, and turns 19-35 were T. Note that there 
are only two differences between the dialogue structures 
yielded by the two tagging schemes. First, in our scheme 
(Figure 6), the greetings (turns 1 and 2) are tagged with 
all the attributes. Second, Smith and Gordon's single tag 
A corresponds to two attribute tags in Table 7, which in 
our scheme defines an extra level of structure within as- 
sessment subdialogues. 

4 Discussion 

This paper presented the PARADISE framework for 
evaluating spoken dialogue agents. PARADISE is a gen- 
eral framework for evaluating spoken dialogue agents 
that integrates and enhances previous work. PARADISE 
supports comparisons among dialogue strategies with a 
task representation that decouples what an agent needs 
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scheme. 



of .82 for reliability of their tagging 
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far 



and 



w th i 
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to achieve in terms of the task requirements ft 
the agent carries out the task via dialogue 
this task representation supports the calculation 
formance over subdialogues as well as whole 
In addition, because PARADISE'S success mea 
malizes for task complexity, it provides a basis 
paring agents performing different tasks. 

The PARADISE performance measure is a 
both task success (k) and dialogue costs (Cj), 
a number of advantages. First, it allows us to 
performance at any level of a dialogue, since 
can be calculated for any dialogue subtask. S 
formance can be measured over any subtask, 
dialogue strategies can range over subdialogue 
whole dialogue, we can associate performance 
vidual dialogue strategies. Second, because our 
measure k takes into account the complexity of 
comparisons can be made across dialogue tasks 
K allows us to measure partial success at achie 1 
task. Fourth, performance can combine both 
and subjective cost measures, and specifies how 
uate the relative contributions of those costs 
overall performance. Finally, to our knowledge 
the first to propose using user satisfaction to 
weights on factors related to performance. 

In addition, this approach is broadly integra 
corporating aspects of transaction success, conce 
racy, multiple cost measures, and user satisfactioi 
framework, transaction success is reflected in k 
sponding to dialogues with a P(A) of 1 . Our perft 
measure also captures information similar to con 
curacy, where low concept accuracy scores trans 
either higher costs for acquiring information f 
user, or lower k scores. 

One limitation of the PARADISE approach is 
task-based success measure does not reflect 
solutions might be better than others. For 
the train timetable domain, we might like our 
success measure to give higher ratings to agents 
gest express over local trains, or that provide he 
formation that was not explicitly requested, 
since the better solutions might occur in dialog 
higher costs. It might be possible to address this 
tion by using the interval scaled data version of 
pendorf, 1980). Another possibility is to simpl) 
tute a domain-specific task-based success measu 
performance model for k. 

The evaluation model presented here has 
plications in apoken dialogue processing. We 
that the framework is also applicable to 
logue modalities, and to human-human task-orier 
logues. In addition, while there are many proposal; 
literature for algorithms for dialogue strategies 
cooperative, collaborative or helpful to the user 
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, very few of these strategies have been 
vhether they improve any measurable as- 
interaction. As we have demonstrated 
strategy can be evaluated, so it should 
show that a cooperative response, or other 
gy, actually improves task performance 
its or increasing task success. We hope 
will be broadly applied in future di- 
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